Project statement

What is the purpose of your analysis? What effect are you investigating? Why?

The purpose of my analysis is to determine what movies I will enjoy the most. I will do this by creating a model with the response variable of how much I liked a movie out of 100 and predicting it with data on various movie reviews, genres, cast and crew, and more. I am doing this so that I can figure out which movies I will want to watch because I can be confident that I will enjoy them.

Data description

How were the data collected? How do these data help you answer the relevant research questions?

The data was collected over time by me. A partial dataset in json form is available on my website here: https://www.tradethisandthat.com/movies/api/all_movies/. In the references there is python code I wrote to turn the mySQL database into a csv. Data from TMDB was collected using their open API. Data from IMDB was collected by web scraping their page for awards and finding any Oscars as well as scraping their page for rating distributions. Much of the data is gotten with python code that I run on movie addition with Django. My rating and metacritic ratings are collected by me.

Exploratory data analysis

Data analysis

What is the statistical model (or models) that you are using? Why is this an appropriate model to use? Do the model diagnostics contradict the model assumptions? How do you interpret the results from the statistical analysis in the context of your research question?

I tried fitting many different models including SLRs, MLRs, polynomials, montone transformations, ridge, lasso, and GAMs. Ultimately, the model that I ended up liking

Models

SLR

# begin with some simple models
fit1 = lm(my_rating~imdb_rating, data=continuous_movies)
summary(fit1)
## 
## Call:
## lm(formula = my_rating ~ imdb_rating, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.975  -8.660   2.319  10.025  50.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.1699     4.4286  -8.167 1.96e-15 ***
## imdb_rating  12.9445     0.6136  21.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.13 on 584 degrees of freedom
## Multiple R-squared:  0.4324, Adjusted R-squared:  0.4315 
## F-statistic:   445 on 1 and 584 DF,  p-value: < 2.2e-16
fit2 = lm(my_rating~metacritic_rating, data=continuous_movies)
summary(fit2)
## 
## Call:
## lm(formula = my_rating ~ metacritic_rating, data = continuous_movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.37 -10.01   1.52  11.83  38.26 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15.80834    2.91499   5.423 8.59e-08 ***
## metacritic_rating  0.61262    0.04279  14.316  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.14 on 584 degrees of freedom
## Multiple R-squared:  0.2598, Adjusted R-squared:  0.2585 
## F-statistic:   205 on 1 and 584 DF,  p-value: < 2.2e-16
fit3 = lm(my_rating~tmdb_rating, data=continuous_movies)
summary(fit3)
## 
## Call:
## lm(formula = my_rating ~ tmdb_rating, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.043  -8.252   2.349  10.104  40.173 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -51.0368     5.7467  -8.881   <2e-16 ***
## tmdb_rating  15.1543     0.8057  18.808   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.8 on 584 degrees of freedom
## Multiple R-squared:  0.3772, Adjusted R-squared:  0.3762 
## F-statistic: 353.7 on 1 and 584 DF,  p-value: < 2.2e-16
plot(continuous_movies$my_rating, fit1$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit2$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit3$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models", bg="transparent",
       legend=c("IMDB", "TMDB",'Metacritic'),
       fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)

MLR

fit4 = lm(my_rating~imdb_rating+metacritic_rating+tmdb_rating,data=continuous_movies) # create an MLR with just movie ratings from other sources
summary(fit4)
## 
## Call:
## lm(formula = my_rating ~ imdb_rating + metacritic_rating + tmdb_rating, 
##     data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.326  -8.704   2.258   9.844  49.695 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -40.48555    5.79723  -6.984 7.88e-12 ***
## imdb_rating        10.53253    1.55661   6.766 3.23e-11 ***
## metacritic_rating   0.03985    0.05674   0.702    0.483    
## tmdb_rating         2.66900    1.78948   1.491    0.136    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.12 on 582 degrees of freedom
## Multiple R-squared:  0.4352, Adjusted R-squared:  0.4323 
## F-statistic: 149.5 on 3 and 582 DF,  p-value: < 2.2e-16
fit5 = lm(my_rating~.-name,data=continuous_movies) # create an MLR with all linear terms
summary(fit5)
## 
## Call:
## lm(formula = my_rating ~ . - name, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.881  -8.154   1.378   8.372  51.355 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   6.418e+00  2.237e+01   0.287  0.77433   
## imdb_rating                   1.688e+00  6.637e+00   0.254  0.79931   
## tmdb_rating                  -6.263e+00  5.267e+00  -1.189  0.23489   
## tmdb_count                   -9.209e-05  3.765e-04  -0.245  0.80688   
## metacritic_rating            -1.867e-01  1.295e-01  -1.442  0.15002   
## budget                       -1.747e-08  1.589e-08  -1.100  0.27205   
## revenue                      -1.436e-09  2.927e-09  -0.490  0.62401   
## runtime                      -3.679e-02  4.214e-02  -0.873  0.38311   
## award_count                  -4.153e-01  3.042e-01  -1.365  0.17270   
## imdb_count                    3.693e-05  1.623e-05   2.276  0.02323 * 
## imdb_arithmetic_mean         -2.288e+00  3.410e+00  -0.671  0.50251   
## imdb_median                  -1.787e-01  8.727e-01  -0.205  0.83786   
## imdb_top_1000_rating          1.790e+00  1.875e+00   0.955  0.34008   
## imdb_top_1000_count           1.889e-02  6.372e-03   2.965  0.00316 **
## imdb_us_rating                6.948e+00  4.351e+00   1.597  0.11095   
## imdb_us_count                -1.267e-04  5.932e-05  -2.136  0.03315 * 
## imdb_not_us_rating            4.190e+00  4.777e+00   0.877  0.38082   
## imdb_not_us_count            -4.886e-05  3.790e-05  -1.289  0.19786   
## mpaa_name                     2.612e-01  6.276e-01   0.416  0.67745   
## imdb_rating_percentile       -1.564e-01  1.141e-01  -1.371  0.17095   
## tmdb_rating_percentile        2.987e-01  1.475e-01   2.025  0.04334 * 
## metacritic_rating_percentile  1.090e-01  6.191e-02   1.760  0.07899 . 
## is_action                     2.840e+00  1.642e+00   1.730  0.08420 . 
## is_comedy                     1.091e+00  1.580e+00   0.690  0.49025   
## is_adventure                 -2.751e+00  1.534e+00  -1.793  0.07351 . 
## is_animation                  1.431e+00  2.656e+00   0.539  0.59023   
## is_family                     3.269e+00  2.376e+00   1.376  0.16945   
## is_drama                      4.133e-01  1.761e+00   0.235  0.81450   
## is_scifi                      1.322e+00  1.673e+00   0.790  0.42970   
## is_thriller                   8.779e-01  1.785e+00   0.492  0.62305   
## is_brad_pitt                  4.318e+00  4.034e+00   1.070  0.28491   
## is_stan_lee                   3.593e+00  3.048e+00   1.179  0.23899   
## is_christopher_nolan          5.361e+00  6.398e+00   0.838  0.40253   
## is_spielberg                 -1.090e+00  3.446e+00  -0.316  0.75200   
## is_harrison_ford              3.944e+00  5.412e+00   0.729  0.46650   
## is_matt_damon                 5.681e+00  4.157e+00   1.367  0.17235   
## is_wes_anderson               7.329e+00  6.880e+00   1.065  0.28724   
## is_tom_cruise                 1.933e+00  4.323e+00   0.447  0.65499   
## is_john_williams              1.013e+01  4.516e+00   2.243  0.02531 * 
## is_rdj                       -5.650e-01  4.649e+00  -0.122  0.90333   
## is_quentin_tarantino         -5.106e+00  5.865e+00  -0.870  0.38443   
## is_tom_hanks                  2.609e+00  4.119e+00   0.633  0.52670   
## is_george_lucas               1.781e+00  5.621e+00   0.317  0.75146   
## is_leonardo_dicaprio          1.055e+00  5.595e+00   0.188  0.85057   
## is_the_rock                   3.416e+00  3.962e+00   0.862  0.38900   
## is_stanley_kubrick            1.951e+00  6.408e+00   0.304  0.76095   
## is_john_hughes                2.951e+00  5.159e+00   0.572  0.56758   
## is_jim_carrey                 7.809e+00  5.203e+00   1.501  0.13396   
## is_wally_pfister              5.665e+00  6.792e+00   0.834  0.40461   
## is_henry_fonda                1.756e+01  1.366e+01   1.286  0.19899   
## is_morgan_freeman            -2.065e+00  4.808e+00  -0.430  0.66770   
## is_bong_joon_ho               5.757e+00  8.088e+00   0.712  0.47690   
## is_dustin_hoffman             9.500e+00  5.612e+00   1.693  0.09108 . 
## is_arnold_schwarz            -5.685e+00  8.090e+00  -0.703  0.48252   
## is_jack_nicholson            -1.476e+01  8.111e+00  -1.820  0.06928 . 
## is_aamir_khan                 2.371e+01  1.403e+01   1.691  0.09151 . 
## is_sean_connery               1.366e+01  8.103e+00   1.685  0.09250 . 
## is_brad_bird                  7.857e+00  6.328e+00   1.242  0.21494   
## is_natalie_portman            2.629e+00  5.006e+00   0.525  0.59976   
## is_robin_williams             1.037e+01  5.239e+00   1.980  0.04826 * 
## is_sandra_bullock             3.728e+00  5.572e+00   0.669  0.50377   
## is_bill_murray                5.583e+00  4.935e+00   1.131  0.25844   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.25 on 524 degrees of freedom
## Multiple R-squared:  0.5526, Adjusted R-squared:  0.5005 
## F-statistic: 10.61 on 61 and 524 DF,  p-value: < 2.2e-16
fit6 = lm(my_rating~imdb_count+imdb_top_1000_count+imdb_us_count+imdb_not_us_count,data=continuous_movies)
summary(fit6) # create an MLR with just movie rating counts
## 
## Call:
## lm(formula = my_rating ~ imdb_count + imdb_top_1000_count + imdb_us_count + 
##     imdb_not_us_count, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.720  -9.889   1.807  11.038  43.118 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.725e+01  2.108e+00  17.675  < 2e-16 ***
## imdb_count           4.757e-05  1.162e-05   4.095 4.82e-05 ***
## imdb_top_1000_count  2.468e-02  4.845e-03   5.093 4.76e-07 ***
## imdb_us_count       -1.000e-05  4.480e-05  -0.223   0.8234    
## imdb_not_us_count   -8.532e-05  3.331e-05  -2.561   0.0107 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.8 on 581 degrees of freedom
## Multiple R-squared:  0.2944, Adjusted R-squared:  0.2895 
## F-statistic: 60.59 on 4 and 581 DF,  p-value: < 2.2e-16
plot(continuous_movies$my_rating, fit4$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit5$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit6$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models", bg="transparent",
       legend=c("IMDB+TMDB+Metacritic", "All",'All Counts'),
       fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)

Transformed and Polynomial MLRs

fit7 = lm(sqrt(my_rating)~.-name,data=continuous_movies) # MLR with full model to sqrt of my rating
summary(fit7)
## 
## Call:
## lm(formula = sqrt(my_rating) ~ . - name, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8407 -0.5095  0.1454  0.6056  3.9915 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   3.353e+00  1.695e+00   1.978  0.04841 * 
## imdb_rating                   5.560e-01  5.028e-01   1.106  0.26928   
## tmdb_rating                  -7.435e-01  3.990e-01  -1.864  0.06294 . 
## tmdb_count                   -1.024e-05  2.853e-05  -0.359  0.71973   
## metacritic_rating            -1.458e-02  9.813e-03  -1.486  0.13792   
## budget                       -1.186e-09  1.204e-09  -0.986  0.32482   
## revenue                      -4.278e-11  2.218e-10  -0.193  0.84709   
## runtime                      -2.128e-03  3.192e-03  -0.667  0.50535   
## award_count                  -3.046e-02  2.304e-02  -1.322  0.18675   
## imdb_count                    2.406e-06  1.229e-06   1.958  0.05078 . 
## imdb_arithmetic_mean         -2.659e-01  2.583e-01  -1.029  0.30379   
## imdb_median                  -4.245e-03  6.611e-02  -0.064  0.94883   
## imdb_top_1000_rating          1.327e-01  1.420e-01   0.935  0.35045   
## imdb_top_1000_count           1.531e-03  4.827e-04   3.171  0.00161 **
## imdb_us_rating                5.424e-01  3.296e-01   1.645  0.10047   
## imdb_us_count                -9.192e-06  4.493e-06  -2.046  0.04128 * 
## imdb_not_us_rating            2.895e-01  3.619e-01   0.800  0.42405   
## imdb_not_us_count            -3.273e-06  2.871e-06  -1.140  0.25473   
## mpaa_name                     1.894e-02  4.754e-02   0.398  0.69046   
## imdb_rating_percentile       -2.235e-02  8.641e-03  -2.587  0.00995 **
## tmdb_rating_percentile        2.926e-02  1.117e-02   2.618  0.00909 **
## metacritic_rating_percentile  7.776e-03  4.690e-03   1.658  0.09792 . 
## is_action                     2.064e-01  1.244e-01   1.659  0.09765 . 
## is_comedy                     5.186e-02  1.197e-01   0.433  0.66505   
## is_adventure                 -2.185e-01  1.162e-01  -1.880  0.06067 . 
## is_animation                  1.211e-01  2.012e-01   0.602  0.54760   
## is_family                     2.445e-01  1.800e-01   1.358  0.17493   
## is_drama                      1.508e-02  1.334e-01   0.113  0.91000   
## is_scifi                      9.787e-02  1.267e-01   0.772  0.44025   
## is_thriller                   7.312e-02  1.352e-01   0.541  0.58890   
## is_brad_pitt                  3.188e-01  3.056e-01   1.043  0.29732   
## is_stan_lee                   2.609e-01  2.309e-01   1.130  0.25900   
## is_christopher_nolan          4.147e-01  4.847e-01   0.856  0.39259   
## is_spielberg                 -8.074e-02  2.610e-01  -0.309  0.75723   
## is_harrison_ford              2.389e-01  4.100e-01   0.583  0.56025   
## is_matt_damon                 3.650e-01  3.149e-01   1.159  0.24705   
## is_wes_anderson               5.662e-01  5.212e-01   1.086  0.27779   
## is_tom_cruise                 1.565e-01  3.275e-01   0.478  0.63299   
## is_john_williams              6.487e-01  3.421e-01   1.896  0.05845 . 
## is_rdj                       -1.184e-02  3.522e-01  -0.034  0.97321   
## is_quentin_tarantino         -3.294e-01  4.443e-01  -0.741  0.45875   
## is_tom_hanks                  1.787e-01  3.120e-01   0.573  0.56716   
## is_george_lucas               1.235e-01  4.258e-01   0.290  0.77186   
## is_leonardo_dicaprio          1.099e-01  4.239e-01   0.259  0.79548   
## is_the_rock                   2.350e-01  3.002e-01   0.783  0.43400   
## is_stanley_kubrick            1.301e-01  4.854e-01   0.268  0.78877   
## is_john_hughes                2.323e-01  3.908e-01   0.594  0.55251   
## is_jim_carrey                 5.674e-01  3.941e-01   1.440  0.15058   
## is_wally_pfister              3.609e-01  5.145e-01   0.701  0.48333   
## is_henry_fonda                7.873e-01  1.034e+00   0.761  0.44697   
## is_morgan_freeman            -2.518e-01  3.642e-01  -0.691  0.48968   
## is_bong_joon_ho               3.681e-01  6.127e-01   0.601  0.54825   
## is_dustin_hoffman             6.196e-01  4.251e-01   1.457  0.14561   
## is_arnold_schwarz            -4.388e-01  6.128e-01  -0.716  0.47431   
## is_jack_nicholson            -9.427e-01  6.144e-01  -1.534  0.12556   
## is_aamir_khan                 1.469e+00  1.063e+00   1.382  0.16749   
## is_sean_connery               8.199e-01  6.138e-01   1.336  0.18221   
## is_brad_bird                  5.077e-01  4.794e-01   1.059  0.29003   
## is_natalie_portman            1.896e-01  3.792e-01   0.500  0.61731   
## is_robin_williams             7.056e-01  3.968e-01   1.778  0.07599 . 
## is_sandra_bullock             2.445e-01  4.221e-01   0.579  0.56269   
## is_bill_murray                2.808e-01  3.739e-01   0.751  0.45292   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 524 degrees of freedom
## Multiple R-squared:  0.5311, Adjusted R-squared:  0.4766 
## F-statistic: 9.731 on 61 and 524 DF,  p-value: < 2.2e-16
fit8 = lm(my_rating~log(imdb_count)+imdb_rating,data=continuous_movies) # simple model with my_rating to imdb_rating and log(imdb_count)
summary(fit8)
## 
## Call:
## lm(formula = my_rating ~ log(imdb_count) + imdb_rating, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.555  -8.591   1.701   9.537  51.826 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -55.8570     5.3204 -10.499  < 2e-16 ***
## log(imdb_count)   2.7418     0.4381   6.258 7.57e-10 ***
## imdb_rating      11.0089     0.6702  16.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.69 on 583 degrees of freedom
## Multiple R-squared:  0.4682, Adjusted R-squared:  0.4663 
## F-statistic: 256.6 on 2 and 583 DF,  p-value: < 2.2e-16
fit9 = lm(my_rating~imdb_rating+I(imdb_rating^2)+I(imdb_rating^3),data=continuous_movies) # fitting a cubic model with imdb_rating
summary(fit9)
## 
## Call:
## lm(formula = my_rating ~ imdb_rating + I(imdb_rating^2) + I(imdb_rating^3), 
##     data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.486  -8.483   2.207   9.741  48.700 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)       48.9929    71.1087   0.689    0.491
## imdb_rating      -24.2463    33.8272  -0.717    0.474
## I(imdb_rating^2)   5.2690     5.2629   1.001    0.317
## I(imdb_rating^3)  -0.2430     0.2679  -0.907    0.365
## 
## Residual standard error: 14.13 on 582 degrees of freedom
## Multiple R-squared:  0.4347, Adjusted R-squared:  0.4318 
## F-statistic: 149.2 on 3 and 582 DF,  p-value: < 2.2e-16
fit10 = lm(sqrt(my_rating)~sqrt(imdb_count)+sqrt(imdb_rating), data=continuous_movies) # fitting with sqrts
summary(fit10)
## 
## Call:
## lm(formula = sqrt(my_rating) ~ sqrt(imdb_count) + sqrt(imdb_rating), 
##     data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9775 -0.5516  0.1587  0.6845  3.7046 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.8029151  0.7059644  -5.387 1.04e-07 ***
## sqrt(imdb_count)   0.0009896  0.0001805   5.484 6.20e-08 ***
## sqrt(imdb_rating)  3.9885846  0.2842290  14.033  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 583 degrees of freedom
## Multiple R-squared:  0.4544, Adjusted R-squared:  0.4525 
## F-statistic: 242.8 on 2 and 583 DF,  p-value: < 2.2e-16
fit11 = lm(my_rating ~ polym(imdb_rating, imdb_count,award_count, degree=5, raw=TRUE),data=continuous_movies) # fitting with a very big polynomial
# summary(fit11)

plot(continuous_movies$my_rating, fit7$fitted.values^2,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit8$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit9$fitted.values,col="firebrick4")
points(continuous_movies$my_rating,fit10$fitted.values^2,col="navajowhite")
points(continuous_movies$my_rating,fit11$fitted.values^2,col="royalblue")
legend(x = "topleft", title="Models", bg="transparent",
       legend=c("SQRT~All", "IMDB+log(imdb_count)",'IMDB^3','SQRT~SQRT','polym'),
       fill = c("orchid","aquamarine3",'firebrick4','navajowhite','royalblue'))
abline(0,1)

Ridge and Lasso

Create data matrix to be used for both ridge and lasso.

X = model.matrix(~ -1 + sqrt(imdb_count) + imdb_rating + sqrt(award_count), data = continuous_movies) # model to use for ridge and lasso fits
y = continuous_movies$my_rating
fit_lm = lm(my_rating ~ sqrt(imdb_count)+ imdb_rating + sqrt(award_count), data = continuous_movies) # non-ridge and lasso to compare to

Ridge

fit_ridge = glmnet(X,y,alpha=0) # ridge fit
fit.cv.ridge = cv.glmnet(X,y,alpha=0)
plot(fit.cv.ridge)

Lasso

fit_lasso = glmnet(X,y,alpha=1) # lasso fit
fit.cv.lasso = cv.glmnet(X,y,alpha=1)
plot(fit.cv.lasso)

Comparison

beta_hat_mlr = coef(fit_lm)
beta_hat_ridge = coef(fit.cv.ridge, s = "lambda.1se")
beta_hat_lasso = coef(fit.cv.lasso, s = "lambda.1se")
cbind(beta_hat_mlr, beta_hat_ridge, beta_hat_lasso)
## 4 x 3 sparse Matrix of class "dgCMatrix"
##                   beta_hat_mlr         s1           s1
## (Intercept)       -24.83846628 3.53534323 -8.363625498
## sqrt(imdb_count)    0.01507946 0.01297789  0.009043295
## imdb_rating        10.20844225 6.25614967  8.363468039
## sqrt(award_count)  -0.05764352 1.50753433  .
plot(continuous_movies$my_rating,predict(glmnet(X,y,alpha=0),as.matrix(X),s=fit.cv.ridge$lambda.1se),xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,predict(glmnet(X,y,alpha=1),as.matrix(X),s=fit.cv.lasso$lambda.1se),col="aquamarine3")
points(continuous_movies$my_rating,fit_lm$fitted.values,col="firebrick4")
legend(x = "topleft", title="Models", 
       legend=c("Ridge", "Lasso", "MLR"), 
       fill = c("orchid","aquamarine3",'firebrick4'))
abline(0,1)

  • Adjusted R Squared at 1 standard deviation for ridge and lasso:
    • ridge: 0.4658365
    • lasso: 0.4671242
    • mlr: 0.4644028 Lasso and ridge result in a slight improvement over the MLR since they take advantage of the bias variance tradeoff. Looking at the beta values, the different models change the coefficients a lot despite getting similar results. MLR values IMDB count and IMDB rating much more than ridge or lasso. Lasso actually removes sqrt(award_count) to not much of a detriment.

Steps with AIC and BIC

library(leaps) # procedure from lab
fit_full = lm(my_rating~.-name+log(imdb_count)+log(tmdb_count)+log(imdb_us_count)+log(imdb_not_us_count), data=continuous_movies) # start by creating a full and add some log terms for values not bounded to a specific range
fit_null = lm(my_rating~1, data=continuous_movies) # create a null fit to just an intercept
anova(fit_null,fit_full,test='F') # make sure that the full fit is more significant than the null fit

Because the p-value for the ful model is significant in comparison to the null model, it makes sense to continue with AIC and BIC to find a significant model. The use of AIC and BIC is to penalize adding more parameters because any parameter improves the fit.

fit_aic = step(fit_null,list(upper=fit_full),direction='forward') # run AIC
n = nrow(continuous_movies)
fit_bic = step(fit_null,list(upper=fit_full),direction='forward',k=log(n)) # run BIC
summary(fit_aic)
## 
## Call:
## lm(formula = my_rating ~ imdb_rating + log(imdb_count) + is_john_williams + 
##     tmdb_rating_percentile + is_robin_williams + is_bill_murray + 
##     is_sean_connery + is_jack_nicholson + is_christopher_nolan + 
##     is_stan_lee + is_comedy + is_matt_damon + imdb_us_rating + 
##     is_dustin_hoffman + is_action + is_adventure + is_family + 
##     is_henry_fonda + is_aamir_khan + imdb_not_us_rating, data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.208  -8.048   0.963   8.901  51.492 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -36.54690    8.02058  -4.557 6.37e-06 ***
## imdb_rating             -5.03237    5.51677  -0.912 0.362055    
## log(imdb_count)          2.18695    0.45396   4.817 1.87e-06 ***
## is_john_williams        11.04011    2.89969   3.807 0.000156 ***
## tmdb_rating_percentile   0.11880    0.04087   2.907 0.003793 ** 
## is_robin_williams       10.74979    5.07432   2.118 0.034571 *  
## is_bill_murray          10.04111    3.87296   2.593 0.009771 ** 
## is_sean_connery         16.23032    7.74485   2.096 0.036560 *  
## is_jack_nicholson      -14.80868    7.67276  -1.930 0.054103 .  
## is_christopher_nolan     9.18126    4.34181   2.115 0.034900 *  
## is_stan_lee              4.44786    2.38016   1.869 0.062179 .  
## is_comedy                2.03353    1.33940   1.518 0.129514    
## is_matt_damon            7.14619    3.89592   1.834 0.067139 .  
## imdb_us_rating           7.19509    3.22451   2.231 0.026048 *  
## is_dustin_hoffman        7.94873    5.42923   1.464 0.143733    
## is_action                3.48623    1.43060   2.437 0.015122 *  
## is_adventure            -3.04652    1.35713  -2.245 0.025166 *  
## is_family                3.13360    1.53023   2.048 0.041041 *  
## is_henry_fonda          19.83960   13.23803   1.499 0.134515    
## is_aamir_khan           20.41390   13.39043   1.525 0.127940    
## imdb_not_us_rating       5.85220    3.93296   1.488 0.137312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 565 degrees of freedom
## Multiple R-squared:  0.5249, Adjusted R-squared:  0.5081 
## F-statistic: 31.21 on 20 and 565 DF,  p-value: < 2.2e-16
summary(fit_bic)
## 
## Call:
## lm(formula = my_rating ~ imdb_rating + log(imdb_count) + is_john_williams, 
##     data = continuous_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.988  -8.279   1.523   9.610  51.665 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -53.7527     5.3313 -10.083  < 2e-16 ***
## imdb_rating       10.9580     0.6659  16.456  < 2e-16 ***
## log(imdb_count)    2.5708     0.4389   5.857 7.89e-09 ***
## is_john_williams   8.5850     2.8724   2.989  0.00292 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.6 on 582 degrees of freedom
## Multiple R-squared:  0.4762, Adjusted R-squared:  0.4735 
## F-statistic: 176.4 on 3 and 582 DF,  p-value: < 2.2e-16
AIC(fit_bic)
## [1] 4727.974
plot(continuous_movies$my_rating, fit_aic$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit_bic$fitted.values,col="aquamarine3")
legend(x = "topleft", title="Models", 
       legend=c("AIC", "BIC"), 
       fill = c("orchid","aquamarine3"))
abline(0,1)

AIC and BIC had very different results. The AIC resulted in my highest adjusted R^2 of any model (0.5080743). The final AIC selected 21 coefficients. This is very different than the BIC which selected 4. Only 3 variables in the AIC have negative betas, is_jack_nicholoson, is_adventure, and imdb_rating. This is unique to this model and is possible because lots of people have highly positive coefficients and log(imdb_count) is also highly positive. Moreover, imdb_rating does not have a signifcant p-value in this model. The BIC chooses a much sparser with just imdb_rating, log(imdb_count), and is_john_williams. John Williams is not only an excellent composer but links many of my favorite movies such as Indiana Jones and Star Wars while only appearing in 2 movies I rated below a 60: Close Encounters of the Third Kind and The Lost World: Jurassic Park. Even with such different models, both the AIC and BIC had similar adjusted r squared values and their adjusted r squared values were generally in line with other models.

General Additive Models

library(mgcv)

Large GAM

fit_gam1 = gam(my_rating~s(imdb_rating)+s(metacritic_rating)+s(tmdb_rating)+s(imdb_count)+s(award_count)+s(runtime),data=continuous_movies)
summary(fit_gam1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## my_rating ~ s(imdb_rating) + s(metacritic_rating) + s(tmdb_rating) + 
##     s(imdb_count) + s(award_count) + s(runtime)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.4343     0.5606   100.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value    
## s(imdb_rating)       1.000  1.000 24.029 1.23e-06 ***
## s(metacritic_rating) 3.051  3.893  1.081    0.318    
## s(tmdb_rating)       2.196  2.822  1.156    0.332    
## s(imdb_count)        3.066  3.836  8.186 3.31e-06 ***
## s(award_count)       1.621  2.015  0.613    0.535    
## s(runtime)           2.471  3.199  1.637    0.195    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.476   Deviance explained = 48.8%
## GCV = 188.82  Scale est. = 184.17    n = 586

GAM with Transformations and Extra Terms

fit_gam2 = gam(my_rating~s(imdb_rating)+s(metacritic_rating)+s(tmdb_rating)+s(sqrt(imdb_count))+s(sqrt(award_count))+s(runtime)+s(budget)+s(imdb_arithmetic_mean),data=continuous_movies)
summary(fit_gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## my_rating ~ s(imdb_rating) + s(metacritic_rating) + s(tmdb_rating) + 
##     s(sqrt(imdb_count)) + s(sqrt(award_count)) + s(runtime) + 
##     s(budget) + s(imdb_arithmetic_mean)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.4343     0.5538   101.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                           edf Ref.df      F  p-value    
## s(imdb_rating)          1.000  1.000 10.144  0.00153 ** 
## s(metacritic_rating)    3.146  4.013  1.241  0.29384    
## s(tmdb_rating)          2.612  3.342  1.489  0.18652    
## s(sqrt(imdb_count))     1.000  1.000 19.511 1.19e-05 ***
## s(sqrt(award_count))    2.597  3.143  0.883  0.45300    
## s(runtime)              2.551  3.302  1.410  0.26940    
## s(budget)               4.788  5.851  1.875  0.09210 .  
## s(imdb_arithmetic_mean) 1.000  1.000  0.546  0.46017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.488   Deviance explained = 50.5%
## GCV =    186  Scale est. = 179.74    n = 586

GAM with Genre Interactions to IMDB Rating and Revenue

fit_gam3 = gam(my_rating~imdb_rating*is_animation+imdb_rating*is_family + imdb_rating*is_adventure + imdb_rating*is_action+imdb_rating*is_drama+imdb_rating*is_comedy+imdb_rating*is_thriller+imdb_rating*is_scifi+s(imdb_count)+imdb_rating*mpaa_name+revenue*is_animation+revenue*is_action+revenue*is_drama+revenue*is_comedy+revenue*is_thriller+revenue*is_scifi,data=continuous_movies)
summary(fit_gam3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## my_rating ~ imdb_rating * is_animation + imdb_rating * is_family + 
##     imdb_rating * is_adventure + imdb_rating * is_action + imdb_rating * 
##     is_drama + imdb_rating * is_comedy + imdb_rating * is_thriller + 
##     imdb_rating * is_scifi + s(imdb_count) + imdb_rating * mpaa_name + 
##     revenue * is_animation + revenue * is_action + revenue * 
##     is_drama + revenue * is_comedy + revenue * is_thriller + 
##     revenue * is_scifi
## 
## Parametric coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.705e-02  1.032e-03  45.577  < 2e-16 ***
## imdb_rating               3.437e-01  7.540e-03  45.577  < 2e-16 ***
## is_animation              4.759e-03  1.044e-04  45.577  < 2e-16 ***
## is_family                 6.716e-03  1.474e-04  45.577  < 2e-16 ***
## is_adventure              9.811e-03  2.153e-04  45.577  < 2e-16 ***
## is_action                 1.447e-02  3.176e-04  45.577  < 2e-16 ***
## is_drama                  1.966e-02  4.313e-04  45.577  < 2e-16 ***
## is_comedy                 1.836e-02  4.028e-04  45.577  < 2e-16 ***
## is_thriller               1.051e-02  2.307e-04  45.577  < 2e-16 ***
## is_scifi                  1.010e-02  2.216e-04  45.577  < 2e-16 ***
## mpaa_name                 2.136e-01  4.686e-03  45.577  < 2e-16 ***
## revenue                   2.050e-08  6.325e-09   3.241  0.00126 ** 
## imdb_rating:is_animation  3.333e-02  7.314e-04  45.577  < 2e-16 ***
## imdb_rating:is_family     4.502e-02  9.878e-04  45.577  < 2e-16 ***
## imdb_rating:is_adventure  6.712e-02  1.473e-03  45.577  < 2e-16 ***
## imdb_rating:is_action     1.017e-01  2.232e-03  45.577  < 2e-16 ***
## imdb_rating:is_drama      1.503e-01  3.297e-03  45.577  < 2e-16 ***
## imdb_rating:is_comedy     1.265e-01  2.776e-03  45.577  < 2e-16 ***
## imdb_rating:is_thriller   7.726e-02  1.695e-03  45.577  < 2e-16 ***
## imdb_rating:is_scifi      7.084e-02  1.554e-03  45.577  < 2e-16 ***
## imdb_rating:mpaa_name     1.561e+00  3.426e-02  45.577  < 2e-16 ***
## is_animation:revenue      1.662e-08  6.052e-09   2.745  0.00623 ** 
## is_action:revenue        -9.210e-09  5.256e-09  -1.752  0.08023 .  
## is_drama:revenue          7.182e-09  5.984e-09   1.200  0.23056    
## is_comedy:revenue        -2.528e-09  5.225e-09  -0.484  0.62866    
## is_thriller:revenue      -5.431e-09  5.659e-09  -0.960  0.33764    
## is_scifi:revenue          6.822e-10  4.677e-09   0.146  0.88408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                     edf    Ref.df        F p-value    
## s(imdb_count) 6.391e-05 6.391e-05 32503749  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 8/36
## R-sq.(adj) =  -0.067   Deviance explained = -7.02%
## GCV = 385.77  Scale est. = 380.5     n = 586

GAM with BIC Variables

fit_gam4 = gam(my_rating~s(imdb_rating)+s(imdb_count)+is_john_williams+log(imdb_count)*is_john_williams,data=continuous_movies)
summary(fit_gam4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## my_rating ~ s(imdb_rating) + s(imdb_count) + is_john_williams + 
##     log(imdb_count) * is_john_williams
## 
## Parametric coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       28.6823     7.4506   3.850 0.000131 ***
## is_john_williams                 -45.0022    75.0277  -0.600 0.548869    
## log(imdb_count)                    2.2408     0.6077   3.687 0.000248 ***
## is_john_williams:log(imdb_count)   4.0007     5.6198   0.712 0.476815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F p-value    
## s(imdb_rating) 2.746   3.52 61.315  <2e-16 ***
## s(imdb_count)  1.000   1.00  0.685   0.408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.475   Deviance explained = 48.1%
## GCV = 186.91  Scale est. = 184.43    n = 586

Comparison

plot(continuous_movies$my_rating, fit_gam1$fitted.values,xlim=c(0,100),ylim=c(0,100),xlab="My Rating",ylab="Fitted Values",main="My Rating vs Fitted Values",col="orchid")
points(continuous_movies$my_rating,fit_gam2$fitted.values,col="aquamarine3")
points(continuous_movies$my_rating,fit_gam3$fitted.values,col="firebrick4")
points(continuous_movies$my_rating,fit_gam4$fitted.values,col="navajowhite")
legend(x = "topleft", title="Models", bg="transparent",
       legend=c("GAM 1", "GAM 2", "GAM 3", "GAM 4"), 
       fill = c("orchid","aquamarine3",'firebrick4','navajowhite'))
abline(0,1)

plot(fit_gam1)
plot(fit_gam2)
plot(fit_gam3)
plot(fit_gam4)

The fit plots have been withheld for spacing reasons. For the first gam: IMDB rating and metacritic rating are linear. TMDB rating is linear until the rating reaches about 6 before increase in slope between 6.5 and 7.5 before flattening. This means that the impact of an increase in TMDB rating from 7.5 to 7.6 is expected to have a larger impact on my rating than an increase from 6.4 to 6.5. IMDB count has a very steep slope until it reaches around 500,000 before flattening out. This impact is fixed by using a log transformation Finally, runtime actually flips. When runtime is under an hour, an increase in runtime leads to an expected increase in my rating. However, as runtime passes 150 minutes, a higher runtime means an expected decrease in my rating.

For the second gam: All terms are linear except for TMDB rating, runtime, and budget TMDB rating and runtime have already been discussed. IMDB arithmetic rating is a new addition to this model and actually has a negative beta model. This exemplifies the slight difference in the way the public IMDB rating is calculated vs the arithmetic mean. The public IMDB rating is calculated using a secret formula to prevent review bombing and is actually a weighted average. This reveals that using the weighted mean is more powerful than the arithmetic mean and that the weighted average used by IMDB is useful.

For budget, there is a quick slope upward at the beginning revealing that a percentage increase in budget is more important that a dollar amount increase. However, this should be taken with a grain of salt as there are flaws to using budget as a prediction metric. Firstly, movies have been made at all different times and budgets are not normalized for inflation. This harms the predictive power of using budget. Moreover, some movies do not have public budget information which is related to their popularity. The mean IMDB count for a movie with a budget that is not 0 is 4.3154913^{5} while for movies with a budget that is 0 the mean count is 3.8796132^{4}, more than 10 times smaller.

Summary and discussion

What are the main conclusions of your analysis? What are the main limitations? What might be investigated in future research?

The main conclusions of my analysis

References

What are the sources for the data and any additional statistical resources that you used to support your analysis?

from django.core.management.base import BaseCommand
from movies.models import *
import csv, os


class Command(BaseCommand):
    help = 'create full database from csv'

    def handle(self, *args, **options):
        FOLDER = os.path.dirname(os.path.abspath(__file__))
        with open(os.path.join(FOLDER, './csv/movie.csv'), 'w',
                  encoding='UTF8', newline='') as csvfile:
            filewriter = csv.writer(csvfile)
            filewriter.writerow(
                ['uuid', 'franchise_id', 'mpaa_id', 'imdb_rating', 'metacritic_rating', 'my_rating', 'tmdb_id',
                 'imdb_id', 'name', 'tmdb_rating', 'tmdb_count', 'poster', 'runtime', 'release_date', 'recent_watch',
                 'viewing_count', 'release_day', 'release_month', 'release_year', 'revenue', 'budget',
                 'distance_from_rating_average', 'my_rating_percentile', 'imdb_rating_percentile',
                 'tmdb_rating_percentile', 'metacritic_rating_percentile', 'genre_ids', 'award_ids', 'award_count',
                 'production_company_ids', 'franchise_name', 'mpaa_name', 'genres_names', 'genre_numbers',
                 'award_names', 'production_company_names', 'imdb_count', 'imdb_arithmetic_mean', 'imdb_median',
                 'imdb_top_1000_rating', 'imdb_top_1000_count', 'imdb_us_rating', 'imdb_us_count', 'imdb_not_us_rating',
                 'imdb_not_us_count']
            )
            genres = WatchInfo.objects.filter(watch_info_type='Genre')
            genre_dict = {genre: i for i, genre in enumerate(genres)}
            for movie in Movie.objects.all():
                filewriter.writerow(
                    [movie.uuid, (movie.franchise_id if movie.franchise is not None else ''),
                     (movie.mpaa_rating_id if movie.mpaa_rating is not None else ''), movie.imdb_rating,
                     movie.metacritic_rating, movie.my_rating_field, movie.tmdb_id, movie.imdb_id, movie.name,
                     movie.tmdb_rating, movie.tmdb_count, movie.poster, movie.runtime, movie.release_date,
                     movie.recent_watch, movie.viewing_count, movie.release_day.name, movie.release_month.name,
                     movie.release_year.name, movie.revenue, movie.budget, movie.distance_from_rating_average,
                     movie.rating_percentile, movie.imdb_rating_percentile, movie.tmdb_rating_percentile,
                     movie.metacritic_rating_percentile,
                     [x.pk for x in movie.genres.all()], [x.pk for x in movie.awards.all()], movie.award_count,
                     [x.pk for x in movie.production_companies.all()],
                     (movie.franchise.name if movie.franchise is not None else ''),
                     (movie.mpaa_rating.name if movie.mpaa_rating is not None else ''),
                     [x.name for x in movie.genres.all()], [genre_dict[x] for x in movie.genres.all()],
                     [x.name for x in movie.awards.all()],
                     [x.name for x in movie.production_companies.all()], movie.imdb_count, movie.imdb_arithmetic_mean,
                     movie.imdb_median, movie.imdb_top_1000_rating, movie.imdb_top_1000_count, movie.imdb_us_rating,
                     movie.imdb_us_count, movie.imdb_not_us_rating, movie.imdb_not_us_count
                     ]
                )
        with open(os.path.join(FOLDER, './csv/place.csv'), 'w',
                  encoding='UTF8', newline='') as csvfile:
            filewriter = csv.writer(csvfile)
            filewriter.writerow(['uuid', 'city', 'state', 'country'])
            for place in Place.objects.all():
                filewriter.writerow([place.uuid, place.city, place.state, place.country])
        with open(os.path.join(FOLDER, './csv/person.csv'), 'w',
                  encoding='UTF8', newline='') as csvfile:
            filewriter = csv.writer(csvfile)
            filewriter.writerow(
                ['uuid', 'tmdb_id', 'name', 'main_role', 'birth_place_id', 'image', 'number_of_movies', 'total_rating',
                 'credit_weighted_score']
            )
            for person in Person.objects.all():
                filewriter.writerow([person.uuid, person.tmdb_id, person.name, person.main_role,
                                     (person.birth_place_id if person.birth_place is not None else ''), person.image,
                                     person.number_of_movies_field, person.total_rating_field,
                                     person.credit_order_weighted_score_field])
        with open(os.path.join(FOLDER, './csv/watch_info.csv'), 'w',
                  encoding='UTF8', newline='') as csvfile:
            filewriter = csv.writer(csvfile)
            filewriter.writerow(
                ['uuid', 'name', 'type', 'movie_count', 'category_average', ]
            )
            for watch_info in WatchInfo.objects.all():
                filewriter.writerow(
                    [watch_info.uuid, watch_info.name, watch_info.watch_info_type, watch_info.movie_count,
                     watch_info.category_average])
            with open(os.path.join(FOLDER, './csv/viewing.csv'), 'w',
                      encoding='UTF8', newline='') as csvfile:
                filewriter = csv.writer(csvfile)
                filewriter.writerow(
                    ['uuid', 'movie_id', 'watch_date', 'watch_location_id', 'watch_device_id', 'watch_platform_id',
                     'viewing_people_ids', 'review', 'my_rating']
                )
                for viewing in Viewing.objects.all():
                    filewriter.writerow([viewing.uuid, viewing.movie_id, viewing.watch_date,
                                         (viewing.watch_location_id if viewing.watch_location is not None else ''),
                                         (viewing.watch_device_id if viewing.watch_device is not None else ''),
                                         (viewing.watch_platform_id if viewing.watch_platform is not None else ''),
                                         [x.pk for x in viewing.people_with.all()], viewing.review, viewing.my_rating])
            with open(os.path.join(FOLDER, './csv/credit.csv'), 'w',
                      encoding='UTF8', newline='') as csvfile:
                filewriter = csv.writer(csvfile)
                filewriter.writerow(
                    ['uuid', 'person_id', 'movie_id', 'department', 'character_job', 'order', 'order_score',
                     'order_weighted_score']
                )
                for credit in Credit.objects.all():
                    filewriter.writerow(
                        [credit.uuid, credit.person_id, credit.movie_id, credit.department, credit.character_job,
                         credit.order, credit.order_score, credit.credit_order_weighted_score])

Code and data

Include the code and data to reproduce your analysis. The code should include clear comments and should run correctly without errors.

Code and data included throughout.